home
***
CD-ROM
|
disk
|
FTP
|
other
***
search
/
Magnum One
/
Magnum One (Mid-American Digital) (Disc Manufacturing).iso
/
d12
/
ddj1190.arc
/
E_FLOYD.ARC
/
WORDS.DOC
< prev
next >
Wrap
Text File
|
1990-10-27
|
7KB
|
185 lines
WORDS Version 1.0 - A fast word extractor program. 3/30/90
Purpose of WORDS
----------------
WORDS extracts a list of unique "words" from an input file, or
several input files, and writes them to an output file, one per
line. The program recognizes a number of options for:
o Set operations on multiple files
o Case sensitivity
o High-order bit stripping
o Alphabetic output sort
o Defining the characters comprising a "word"
How to run WORDS
----------------
From the DOS command line enter:
WORDS filenames [-U/-I/-C] [-A] [-L] [-H] [-W[+/-]abc..]
[-Oname] [@name]
Spaces delimit command line parameters. You may intermingle
input text filenames and options (mark each option with a leading
hyphen). Some options (-W,-O) allow a character string or
filename to follow the option letter. This must follow with no
intervening spaces or the program will mistake it for an input
file name. Some options (-A,-L,-H) allow a "+" or "-" to
indicate "on" or "off". This also must follow with no
intervening space, and "+" is assumed if it is omitted. You may
place options and filenames in an ASCII "include" file and
specify its name with a leading "@" on the command line. An
include file may contain references to other include files. You
also may specify default options, filenames and include files in
the DOS environment using "SET WORDS=...". For example:
SET WORDS=-U -A+ -L+ -Owords.out -W-ABCDEFGHIJKLMNOPQRSTUVWXYZ
SET WORDS=@defaults.wrd -O
WORDS processes options left-to-right, first from the DOS
environment, then from the command line. Where options conflict,
the last option processed prevails. Thus, you may override "SET"
environment options on the command line.
What the options mean
---------------------
-U, -I or -C specifies the set operation to be performed on the
extracted words from the input files. Only one of these options
is active for any given WORDS run. The operations are:
-U Union: Keep all unique words from any input file.
This is the default.
-I Intersection: Keep unique words common to all input
files.
-C Complement: Keep unique words from the second and
subsequent files, only if they are NOT contained in
the first file.
Other options:
-A[+/-] Sort output words alphabetically (default off). If
-A is off, output words will be in order of first
encounter in the input files.
-H[+/-] Clear the high-order bit on each input character
(default off). Use this option to process files
created by word processing programs, like WordStar,
that mark some letters by setting the high-order
bit, often at the beginning or end of a word.
-L[+/-] Lower case is significant (default off). If -L is
off, the program will shift all output words to
upper case.
-W-abc.. Replace the "word character set" with the indicated
characters. The program checks each character in
each input file for membership in the word character
set and defines a "word" as an uninterrupted
sequence of at least one but no more than 35
characters which are members of that set. The
default is the set of upper and lower case
alphabetic characters.
-W+abc.. Add additional characters to the word character set.
-O[name] Name the output file. If the name is omitted ("-O "),
output goes to "StdOut" and is available for DOS a
pipe (|) or redirection (>). StdOut is the
default.
-O- Suppress output. -Onul also suppresses output. The
program will still display word counts on the
screen.
Three examples
--------------
1. Generate an alphabetized list of all words appearing in the
document named WORDS.DOC and write the list to file WORDS.LST.
The following are equivalent:
WORDS words.doc -U -A -Owords.lst
WORDS words.doc -A >words.lst (defaults: -U, StdOut)
WORDS -U words.doc -A+ >words.lst
SET WORDS=-A+ -Owords.lst (set defaults)
WORDS words.doc
2. Given a previously extracted list of words in file SPELL.CHK,
generate a list of words from file LETTER.DOC which are NOT in
SPELL.CHK, and write the list to LETTER.BWD.
WORDS -C spell.chk letter.doc -Oletter.bwd
(A poor persons spelling checker?)
3. Given file PASCAL.PRC containing a list of Pascal library
procedure names, determine which procedures are referenced by
Pascal source program BIGPROG.PAS and write the referenced
procedure names to the screen with a pause at each screen full.
WORDS pascal.prc bigprog.pas -I -W+_0123456789 | more
(Pascal identifiers may contain numerics or the "_" (underline)
character, so we add these to the word set via "-W+_01..".)
Limitations
-----------
A "word" may be no longer than 35 characters. No more than
65,535 unique words may accumulate. All data must fit in main
memory (this usually determines the limit). Each unique word
occupies its length in memory, plus an overhead of about 7 bytes,
plus another 10 bytes at output time if output is alphabetized
(-A+). Thus, with a typical available memory of 500k and 5
characters (average) per word, you will run out of memory after
about 40,000 unique words, unsorted.
FYI, network users, WORDS opens its input files in "Read, Deny
None" mode, @include files "Read, Compatibility", and the output
file in "Write, Compatibility". Only one file at a time is open,
except during processing of nested @include files.
Legal Stuff
-----------
WORDS.EXE and WORDS.DOC are:
Copyright 1990 by Edwin T. Floyd,
All rights reserved.
WORDS is copyrighted "free" software. The author hereby
expressly permits and encourages individuals to use WORDS at home
and at work and to distribute it without charge. The author
prohibits distribution of WORDS for profit, or as a part of a
product sold for profit, except where explicit written permission
has been obtained from the author for such distribution. Also,
users groups and shareware libraries charging a disk duplication
fee not exceeding $10.00 may distribute WORDS.
The author makes no warranties of any kind, either expressed or
implied, as to mercantability or fitness for any particular
purpose. WORDS.EXE and WORDS.DOC are available as is and in no
event will the author be held liable for damages, including any
lost profits or incidental or consequential damages, even if the
author has been advised of the possibility of such damages.
Authorship
----------
WORDS was written in Turbo Pascal v5.5 by:
Edwin T. Floyd [76067,747] (CompuServe)
#9 Adams Park Court 404/576-3305 (work)
Columbus, GA 31909 404/322-0076 (home)
The latest version of WORDS is available on CompuServe in the
IBMPRO forum, and on a number of bulletin boards around the
country. If you are a Pascal programmer interested in the
technology used to write this program, please drop me a line.
- Edwin - 3-30-90